Causal language modeling

There are two types of language modeling, causal and masked.

Causal language models are frequently used for text generation.

Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left.

This means the model cannot see future tokens.

GPT-2 is an example of a causal language model.

Now create a batch of examples using DataCollatorForLanguageModeling. (Preprocess)

Use the end-of-sequence token as the padding token and set mlm=False. This will use the inputs as labels shifted to the right by one element:

code:PyTorchの例.py

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)